Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add get_size_on_disk method to RemoteData #6584

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

GeigerJ2
Copy link
Contributor

@GeigerJ2 GeigerJ2 commented Oct 15, 2024

Required for PR #6578.

By default, the get_size_on_disk method calls the method get_size_on_disk_du that uses du to obtain the total directory size in bytes. The output will eventually be formatted in a human-readable way. If the call to du fails for whatever reason, recursive lstat is being used, though, that is discouraged due to (copied from _get_size_on_disk_lstat docstring):

"Note that even if a file is only 1 byte, on disk, it still occupies one full disk block size. As such, getting accurate measures of the total expected size on disk when retrieving a RemoteData is not straightforward with lstat, as one would need to consider the occupied block sizes for each file, as well as repository metadata. Thus, this function only serves as a fallback in the absence of the du command."

I further extended the existing tests for RemoteData to use both, LocalTransport, as well as SshTransport, and test for the functionality added in this PR.

Pinging also @npaulish, as she's currently working on retrieving RemoteData objects.

Copy link

codecov bot commented Oct 15, 2024

Codecov Report

Attention: Patch coverage is 80.00000% with 11 lines in your changes missing coverage. Please review.

Project coverage is 77.91%. Comparing base (ef60b66) to head (ceb0673).
Report is 136 commits behind head on main.

Files with missing lines Patch % Lines
src/aiida/orm/nodes/data/remote/base.py 77.78% 10 Missing ⚠️
src/aiida/common/utils.py 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6584      +/-   ##
==========================================
+ Coverage   77.51%   77.91%   +0.41%     
==========================================
  Files         560      567       +7     
  Lines       41444    42200     +756     
==========================================
+ Hits        32120    32875     +755     
- Misses       9324     9325       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@GeigerJ2 GeigerJ2 changed the title Add get_total_size_on_disk method to RemoteData. Add get_total_size_on_disk method to RemoteData Oct 17, 2024
@giovannipizzi
Copy link
Member

This is maybe machine-dependent, but rather than going via our API (that is more robust, but definitely going to be slower, I think) have a first "fast" option just running du -s and parsing the output (but careful about units! E.g. it uses "blocks", on some systems it's 512 or 2048bytes!! And if it fails, fall back to your solution?

@GeigerJ2
Copy link
Contributor Author

GeigerJ2 commented Oct 17, 2024

Note to self to run du via exec_command_wait method from transport.

@GeigerJ2 GeigerJ2 force-pushed the feature/remote-data-total-size branch from e801a78 to e0be575 Compare November 21, 2024 14:54
@GeigerJ2 GeigerJ2 marked this pull request as ready for review November 21, 2024 14:54
@GeigerJ2 GeigerJ2 changed the title Add get_total_size_on_disk method to RemoteData Add get_size_on_disk method to RemoteData Nov 21, 2024
@GeigerJ2 GeigerJ2 force-pushed the feature/remote-data-total-size branch from e0be575 to b1a0560 Compare November 21, 2024 15:14
Copy link
Contributor

@khsrali khsrali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GeigerJ2 please go through more pains that I imposed 😈

src/aiida/common/utils.py Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
src/aiida/orm/nodes/data/remote/base.py Show resolved Hide resolved
:rtype: int
"""

retval, stdout, stderr = transport.exec_command_wait(f'du --bytes {full_path}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
retval, stdout, stderr = transport.exec_command_wait(f'du --bytes {full_path}')
retval, stdout, stderr = transport.exec_command_wait(f'du -h {full_path}')

that already returns the human readable one, why not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for testing and validation, I think returning the bytes is more convenient. I'd convert it to the human-readable format only at the last step, when printing it to the user.

Helper function for recursive directory traversal to obtain the `listdir_withattributes` result for all
subdirectories.

:param path: Path to be traversed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolute or relative?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it absolute then ;)

for item in contents:
item_path = os.path.join(path, item['name'])
if item['isdir']:
# Include size of direcotry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused here, is the size of a directory different from it's contents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the size returned from this call is the size of the directory inode, which includes the metadata of its (file) contents, but not the actual, total (file) content size. Very annoying indeed.

src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved

return format_directory_size(size_in_bytes=total_size)

def _get_size_on_disk_du(self, full_path: Path, transport: 'Transport') -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole function, doesn't really need to be separated I feel.. It's basically only executes a command, and you don't re-use it, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the design to have just one top-level method get_size_on_disk nice, which tries to call the two private methods, in terms of separation of concerns.

Though, it is true that this pollutes the API of RemoteData somewhat... So I'm also fine of either moving the private methods to some utils module, or merging them. Maybe @unkcpz can comment on good coding practices here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which tries to call the two private methods, in terms of separation of concerns.

I'm not sure if I understand, sorry 😬
Since it's not really re-usable by other method, I'd vote for merging it, and avoid over-fictionalizing

:rtype: int
"""

retval, stdout, stderr = transport.exec_command_wait(f'du --bytes {full_path}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more concern, is the aiida-firecrest plugin..
they don't support command execution, so it's better we avoid adding more calls to exec_command_wait in the code base, wherever it's not absolutely crucial...
I mean it adds maintenance overheads.. in future somebody will open an issue and PR to change this..

Also your nice _get_size_on_disk_lstat function is already addressing this functionality, so du doesn't seem super crucial

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I can add a check on the transport-type before, to make sure it stays compatible with FirecREST in the future. Though, I wouldn't remove the convenient and preferred implementation for now in anticipation of FirecREST eventually becoming the required transport mechanism for CSCS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, please! Makes sense to add a except NotImplementedError on line 232, in case exec_command_wait is not implemented.

@pytest.fixture
def remote_data_ssh(tmp_path, aiida_computer_ssh):
"""Return a non-empty ``RemoteData`` instance."""
# Compared to `aiida_localhost`, `aiida_computer_ssh` doesn't return an actual `Computer`, but just a factory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙉 aiida_computer_ssh should be compared with aiida_computer_local, which in that sense they are similar.. the issue is we don't have "aiida_ssh" that would return the actual computer instance, lol 🙉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, exactly. We could add a fixture that actually calls the factory and returns the Computer instance, though I'd rather call it aiida_localhost_ssh, and don't see the need for it right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yews, so maybe please fix this comment, which I found confusing:

Suggested change
# Compared to `aiida_localhost`, `aiida_computer_ssh` doesn't return an actual `Computer`, but just a factory
# `aiida_computer_ssh` doesn't return an actual `Computer`, but just a factory

Copy link
Contributor Author

@GeigerJ2 GeigerJ2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why my replies via the files GH tab appear as a review, but whatever... ^^

Thanks for the review @khsrali. I implemented most of your proposed changes!

As to our in-person discussion if du or lstat is preferred, I think none of the two is ideal... lstat giving the actual byte-sized content, which is neat and all, but won't correspond to the disk space that will actually be occupied locally by a file, due to the use of blocks for the file system. And du giving the actual occupied disk space on the remote, which, however, might be different from the local file system (due to having a different file system on the local machine, different formatting, different block size, etc.). Hence, the big difference in the file size check in the test. For real-world use cases, with more and larger files, the difference is likely much smaller, and won't matter too much, I think. I'll add a test for a larger file, as well as modify the message that the given size is just an estimate, so that user are aware they should take the value with a grain of salt.

Maybe also @agoscinski with his actual computer science background can weigh in 🫶

src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved

return format_directory_size(size_in_bytes=total_size)

def _get_size_on_disk_du(self, full_path: Path, transport: 'Transport') -> int:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the design to have just one top-level method get_size_on_disk nice, which tries to call the two private methods, in terms of separation of concerns.

Though, it is true that this pollutes the API of RemoteData somewhat... So I'm also fine of either moving the private methods to some utils module, or merging them. Maybe @unkcpz can comment on good coding practices here?

:rtype: int
"""

retval, stdout, stderr = transport.exec_command_wait(f'du --bytes {full_path}')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for testing and validation, I think returning the bytes is more convenient. I'd convert it to the human-readable format only at the last step, when printing it to the user.

:rtype: int
"""

retval, stdout, stderr = transport.exec_command_wait(f'du --bytes {full_path}')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I can add a check on the transport-type before, to make sure it stays compatible with FirecREST in the future. Though, I wouldn't remove the convenient and preferred implementation for now in anticipation of FirecREST eventually becoming the required transport mechanism for CSCS.

for item in contents:
item_path = os.path.join(path, item['name'])
if item['isdir']:
# Include size of direcotry
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the size returned from this call is the size of the directory inode, which includes the metadata of its (file) contents, but not the actual, total (file) content size. Very annoying indeed.

src/aiida/orm/nodes/data/remote/base.py Outdated Show resolved Hide resolved
Helper function for recursive directory traversal to obtain the `listdir_withattributes` result for all
subdirectories.

:param path: Path to be traversed.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it absolute then ;)

GeigerJ2 and others added 6 commits November 25, 2024 16:41
In addition, extend the tests for `RemoteData` in `test_remote.py`
for the methods added in this PR, as well as parametrize them to run on
a `RemoteData` via local and ssh transport.
In addition, extend the tests for `RemoteData` in `test_remote.py`
for the methods added in this PR, as well as parametrize them to run on
a `RemoteData` via local and ssh transport.
@GeigerJ2 GeigerJ2 force-pushed the feature/remote-data-total-size branch from f2f90f9 to 4fcb1ba Compare November 25, 2024 15:41
Copy link
Contributor

@khsrali khsrali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @GeigerJ2! Just a few minor comments..

src/aiida/common/utils.py Show resolved Hide resolved
:rtype: int
"""

retval, stdout, stderr = transport.exec_command_wait(f'du --bytes {full_path}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, please! Makes sense to add a except NotImplementedError on line 232, in case exec_command_wait is not implemented.

@pytest.fixture
def remote_data_ssh(tmp_path, aiida_computer_ssh):
"""Return a non-empty ``RemoteData`` instance."""
# Compared to `aiida_localhost`, `aiida_computer_ssh` doesn't return an actual `Computer`, but just a factory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yews, so maybe please fix this comment, which I found confusing:

Suggested change
# Compared to `aiida_localhost`, `aiida_computer_ssh` doesn't return an actual `Computer`, but just a factory
# `aiida_computer_ssh` doesn't return an actual `Computer`, but just a factory

return format_directory_size(size_in_bytes=total_size)

def _get_size_on_disk_du(self, full_path: Path, transport: 'Transport') -> int:
"""Connects to the remote folder and returns the total size of all files in the directory recursively in bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Connects to the remote folder and returns the total size of all files in the directory recursively in bytes
"""Connects to the remote folder and returns the total size of all files in the directory in bytes

:param transport: Open transport instance
:type transport: Transport
:raises RuntimeError: When `du` command cannot be successfully executed
:return: Total size of directory recursively in bytes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I found the term recursively confusing.. (thought the function does yield )

Suggested change
:return: Total size of directory recursively in bytes.
:return: Total size of directory in bytes (including all it's contents)

:param transport: Open transport instance.
:type transport: Transport
:raises RuntimeError: When `du` command cannot be successfully executed.
:return: Total size of directory recursively in bytes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same:

Suggested change
:return: Total size of directory recursively in bytes.
:return: Total size of directory in bytes (including all it's contents)


return format_directory_size(size_in_bytes=total_size)

def _get_size_on_disk_du(self, full_path: Path, transport: 'Transport') -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which tries to call the two private methods, in terms of separation of concerns.

I'm not sure if I understand, sorry 😬
Since it's not really re-usable by other method, I'd vote for merging it, and avoid over-fictionalizing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants